The tidyverse is a collection of packages for doing data science:
It includes a website: http://tidyverse.org and a book: http://r4ds.had.co.nz
At the heart of the tidyverse is tidy data. Every variable is a column, every row is a case.
Base R - we could use built-in functions like aggregate(), by(), ave()
Data are here for this example:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
Adaption of examples by nightrose Data Scientist AIG New York
Read data via readr
# getwd()
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(readr)
bankData <- read_delim("data/bank-full.csv", ";")
Parsed with column specification:
cols(
age = col_integer(),
job = col_character(),
marital = col_character(),
education = col_character(),
default = col_character(),
balance = col_integer(),
housing = col_character(),
loan = col_character(),
contact = col_character(),
day = col_integer(),
month = col_character(),
duration = col_integer(),
campaign = col_integer(),
pdays = col_integer(),
previous = col_integer(),
poutcome = col_character(),
y = col_character()
)
bankData
# let us explore the data set a bit
View(bankData)# allows us to view the data set
names(bankData) # names of the variables
[1] "age" "job" "marital" "education" "default"
[6] "balance" "housing" "loan" "contact" "day"
[11] "month" "duration" "campaign" "pdays" "previous"
[16] "poutcome" "y"
dim(bankData) # dimension (number of rows and columns)
[1] 45211 17
str(bankData) # structure of the data set
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
$ marital : chr "married" "single" "married" "married" ...
$ education: chr "tertiary" "secondary" "secondary" "unknown" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : chr "yes" "yes" "yes" "yes" ...
$ loan : chr "no" "no" "yes" "no" ...
$ contact : chr "unknown" "unknown" "unknown" "unknown" ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : chr "may" "may" "may" "may" ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
- attr(*, "spec")=List of 2
..$ cols :List of 17
.. ..$ age : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ job : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ marital : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ education: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ default : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ balance : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ housing : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ loan : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ contact : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ day : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ month : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ duration : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ campaign : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ pdays : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ previous : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ poutcome : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ y : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
class(bankData)
[1] "tbl_df" "tbl" "data.frame"
head(bankData, n = 5)
tail(bankData, n = 5)
summary(bankData)
age job marital
Min. :18.00 Length:45211 Length:45211
1st Qu.:33.00 Class :character Class :character
Median :39.00 Mode :character Mode :character
Mean :40.94
3rd Qu.:48.00
Max. :95.00
education default balance
Length:45211 Length:45211 Min. : -8019
Class :character Class :character 1st Qu.: 72
Mode :character Mode :character Median : 448
Mean : 1362
3rd Qu.: 1428
Max. :102127
housing loan contact
Length:45211 Length:45211 Length:45211
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
day month duration
Min. : 1.00 Length:45211 Min. : 0.0
1st Qu.: 8.00 Class :character 1st Qu.: 103.0
Median :16.00 Mode :character Median : 180.0
Mean :15.81 Mean : 258.2
3rd Qu.:21.00 3rd Qu.: 319.0
Max. :31.00 Max. :4918.0
campaign pdays previous
Min. : 1.000 Min. : -1.0 Min. : 0.0000
1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
Median : 2.000 Median : -1.0 Median : 0.0000
Mean : 2.764 Mean : 40.2 Mean : 0.5803
3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :63.000 Max. :871.0 Max. :275.0000
poutcome y
Length:45211 Length:45211
Class :character Class :character
Mode :character Mode :character
hist(bankData$age,
main = "Histogram of Age",
xlab = "Age in Years")
boxplot(bankData$age,
main = toupper("Boxplot of Age"),
ylab = "Age in years",
col = "blue")
d <- density(bankData$age)
plot(d, main = "Kernel density of Age")
polygon(d, col = "red", border = "blue")
Seven dplyr verbs & description fundamental functions of data transformation select() select columns/variables filter() filter rows / provides basic filtering capabilities arrange() re-order or arrange rows /ordering data mutate() create new columns/variables summarise() summarise values/data by functions of choice group_by() allows for group operations in the “split-apply-combine” concept / groups data by categorical levels join() joining separate dataframes
The filter verb takes conditions for filtering rows based on conditions
Subsetting Example 1 Filter
filter(bankData, default == 'yes')
Subsetting Example 2 Filter
filter(bankData, balance < 1000)
Subsetting Example 3
filter(bankData, month %in% c("april", "may", "jun"),
default == "yes")
You can also extract particular rows by number using slice().
slice(bankData, 5:10)
You can use the select() verb to specify which columns of a dataset you want
Select Example 1
select(bankData, age, job, default, balance, housing)
Select Example 2
select(bankData, default:duration, contains("p"))
Rename verb to easily rename variables Select Example 3
select(bankData, bought_option=y)
Rename Example
rename(bankData, bought_option=y)
You can reorder your dataset based on conditions using the arrange() verb
Arrange Example 1
arrange(bankData, job, default)
Arrange Example 2
arrange(bankData, balance, default)
Arrange Example 3 You can use desc() to sort in descending order.
arrange(bankData, desc(balance), default)
Transformations
The mutate() verb can be used to make new columns
mutate(bankData, "DefaultFlag" = ifelse(default == 'yes', 1, 0))
Transformations 2
mutate(bankData, "BalanceByDuration" = balance/duration)
mutate() retains all columns. If you only want to keep the new transforms, you can use transmute()
Transmute Example
transmute(bankData, "BalanceByDuration" = balance/duration)
Summarise Data by Groups The group_by verb creates a grouping by a categorical variable
args(group_by)
function (.data, ..., add = FALSE)
NULL
Example group_by 1
summarise(group_by(bankData, default), Num = n())
Example group_by 2
summarise(group_by(bankData, default), Ave.Balance = mean(balance))
Example group_by 3
summarise(group_by(bankData, default), Ave.Balance = mean(balance), Num = n())
Chaining/Piping The pipe operator %>% Passes result on left into first argument of function on right. Piping is not restricted to dplyr manipulation tasks
Take the hflights data set and then…Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then… Pick all of the rows whose diff value does not equal NA, and then… Summarise the data set with a value named avg that is the mean diff value. hflights %>% mutate(diff=(TaxiIn-TaxiOut)) %>% filter(is.na(diff)) %>% summarise(avg=mean(diff))
Standard
arrange(filter(select(bankData, age, job, education, default), default == 'yes'), job, education, age)
With Pipes
arrange(
filter(
select(bankData, age, job, education, default),
default == 'yes'),
job, education, age)
Piping example The pipe operator is very helpful for group by summaries
bankData %>%
select(age, job, education, default) %>%
filter(default == 'yes') %>%
arrange(job, education, age)
No Pipes
x1 <- rnorm(10)
x2 <- rnorm(10)
sqrt(sum((x1 - x2)^2))
[1] 3.844407
With Pipes
(x1 - x2)^2 %>% sum() %>% sqrt()
[1] 3.844407
Pipe + group_by() The pipe operator is very helpful for group by summaries
bankData %>% group_by(job) %>%
summarise(Number = n(),
Average.Balance = mean(balance),
Number.Defaulted = sum(default == 'yes'),
Default.Rate = Number.Defaulted/Number)
Pipe and Ploting
library(ggplot2)
bankData %>%
filter(job %in% c("management", "technician", "unemployed")) %>%
group_by(job, marital) %>%
summarise(Counts = n() ) %>%
ggplot() +
geom_bar(aes(x = job, y = Counts, fill = marital),
stat = 'identity', position = 'dodge')
Piping: Unique Values Piping is also very helpful with identifying unique rows. You can also use distinct() to identify unique rows and is typically used with arrange().
bankData %>%
select(job, marital, education, default, housing, loan, contact) %>%
arrange(job, marital, education, default, housing, loan, contact) %>%
distinct()
Unique Keys You can specify variables that you only want unique values for.
bankData %>%
select(job, marital, education, default, housing, loan, contact) %>%
arrange(job, marital, education, default, housing, loan, contact) %>%
distinct(job, marital, education)
Unique Keys It will keep the first row with those particular key values.
bankData %>%
select(job, marital, education, default, housing, loan, contact) %>%
arrange(job, marital, education, desc(default), desc(housing), desc(loan), desc(contact)) %>%
distinct(job, marital, education)
Multiple Columns You can summarise or mutate multiple columns using the same grouping variable. summarise_each allows you to apply the same summary function to multiple columns mutate_each also does a similar manipulation for mutate
help(summarise_each)
Summarise_each Example
bankData %>%
group_by(education) %>%
summarise_each(funs(mean), balance, duration)
summarise_each Example 2 You can also use multiple functions.
bankData %>%
group_by(education) %>%
summarise_each(funs(min, mean, max), balance, duration)
mutate_each Example You can use the . to indicate where the variables go in an arbitrary function.
bankData %>%
group_by(month) %>%
select(balance, duration) %>%
mutate_each(funs(half = ./2))
Adding missing grouping variables: `month`
Additional Helper Functions Helper functions n() and count() count the number of rows in a group Helper function n_distinct(vector) counts the number of unique items in that vector
bankData %>%
group_by(job, default) %>%
summarise(education_levels = n_distinct(education))
tally tally() is a shortcut for counting
bankData %>% group_by(job) %>%
tally()
Without tally()
bankData %>% group_by(job) %>%
summarise(n = n())
count count() makes it even easier.
bankData %>% count(job)
Ranking Variables In base R, you can use rank.
args(rank)
function (x, na.last = TRUE, ties.method = c("average", "first",
"random", "max", "min"))
NULL
Rank Examples
bankData %>% slice(1:10) %>%
transmute(Job = job,
jobRankAvg = rank(job),
jobRankRow = row_number(job),
jobRankMin = min_rank(job),
jobRankDense = dense_rank(job),
jobRankPerc = percent_rank(job),
jobRankCume = cume_dist(job))
Applying custom functions You can also apply your own custom functions using do()
set.seed(1)
df <- data.frame(
houseID = rep(1:10, each = 10),
year = 1995:2004,
price = ifelse(runif(10 * 10) > 0.50, NA, exp(rnorm(10 * 10)))
)
head(df)
Grouped Tests
bankData %>%
filter(marital %in% c('married', 'single')) %>%
group_by(job) %>%
do(tTest = t.test(age ~ marital, data = .)) %>%
mutate("tTestPVal" = get("p.value",tTest), "tTestStat" = get("statistic", tTest))
Changing gears… Let’s take a look at a new dataset:
Data are from https://stat.duke.edu/~mc301/data/hdi.csv
The Human Development Index (HDI) is a composite statistic of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development.
Adaption of example by Mine Cetinkaya-Rundel
tidyr
A package that reshapes the layout of tabular data.
http://vita.had.co.nz/papers/tidy-data.html
tidyr Operations
There are four fundamental functions of data tidying:
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider separate() splits a single column into multiple columns unite() combines multiple columns into a single column
hdi <- read_csv("https://stat.duke.edu/~mc301/data/hdi.csv")
View(hdi)
Wide to long data with gather
library(tidyr)
library(stringr)
hdi_long <- gather(hdi, key = year, value = hd_index, hdi_1980:hdi_2011)
View(hdi_long)
Let’s do a little bit better…
hdi_long2 <- hdi_long %>%
mutate(year = as.numeric(str_replace(year, "hdi_", "")))
View(hdi_long2)
Long to wide data with spread
hdi_wide <- spread(hdi_long2, key = year, value = hd_index)
View(hdi_wide)
This is just a small peak at the functions and power of the tidyverse.